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Reliability  Models  for  Multiprocessor  Systems 
With  and  Without  Periodic  Maintenance 

Abstract 

Multiprocessor  systems,  although  designed  for  speed  and  processing  power,  lend 
inherently  to  redundancy.  Appropriately  designed  distributed  intelligence  systems  that 
utilize  system  reconfiguration  and  graceful  degradation  can  be  substantially  more 
reliable  than  uniprocessor  systems.  Reliability  models  for  two  multiprocessor  systems, 
C.mmp  and  Cm*,  are  presented  and  compared  to  a single  LSI-11  processor. 

With  the  exception  of  spacebourne  systems,  most  systems  may  be  subjected  to 
tests  to  ensure  proper  functioning.  When  performed  regularly,  these  integrity  checks 
enhance  confidence  in  the  system,  and  its  expected  mean  time  to  failure.  Effect  of 
such  periodic  maintenance  is  modeled.  The  expected  life  is  seen  to  depend  strongly  on 
the  efficiency  of  the  tests.  The  improvement  in  expected  life,  however,  is  observed  to 
be  limited  by  non-redundant  parts  of  a system.  Under  periodic  maintenance,  Cm* 
system  offers  greater  life  than  C.mmp  for  tasks  allowing  considerable  redundancy. 
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1.  Reliability  Considerations 

1.1.  Introduction 

Interest  in  highly  reliable  systems  has  increased  dramatically  in  the  last  five 
years.  This  can  be  attributed  lo  at  least  three  trends  : 

(Digital  systems  have  bee  .crcasingly  appliecs  in  areas  where  a 
failure  can  lead  to  catastrop.  consequences. 

System  complexity  has  increased  as  system  performance  and 
capability  have  increased.  Increased  complexity  in  nonredundant 
system  means  a less  reliable  system. 

Decreased  hardware  cost  has  expanded  the  application  areas  of 
digital  systems.  These  new  application  areas  require  increased 
unattended  system  reliability  since  the  users  are  less  sophisticated 
and  the  cost  of  repair  personnel  easily  dominates  the  system  cost. 

In  order  to  evaluate  and  compare  systems  an  accurate  reliability  model  is 
essential. 

It  is  a common  practice  in  reliability  modeling  to  divide  a system  under 
investigation  into  a number  of  subsystems  or  modules.  A judicious  partitioning  leads  to 
a set  of  modules  that  are  statistically  mutually  indenpedent.  the  reliability  of  a 
nonredundant  system  is  then  merely  the  product  of  reliabilities  of  various  modules. 

The  problem  that  still  remains  is  that  of  finding  the  reliability  of  individual 
modules.  Historically  it  was  an  accepted  practice  to  assume  the  statistical 
independence  at  the  gate  level,  and  raise  the  gate  reliability  to  the  number  of  gates  in 
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the  module.  The  present  day  technology  of  large  scale  integration  renders  this 
technique  obsolete.  Although  reliability  is  still  a function  of  the  complexity,  the 
complexity  may  no  longer  be  treated  as  a simple  function  of  the  number  of  gates.  The 
following  subsection  outlines  the  currently  acceptable  approach.  Section  2 presents  a 
comparison  of  two  multiprocessor  systems  : C.mmp  and  Cm*.  Finally,  section  3 
discusses  the  relaibility  of  system  employing  periodic  maintenance. 

1.2.  Parts  Count  Model 

Before  presenting  the  parts  count  model,  let  us  review  its  basic  assumptions.  It 
will  be  assumed  the  system  is  constructed  of  printed  circuit  boards.  The  PC  boards 
hold  1C  chips  that  are  assumed  to  be  statistically  independent.  It  is  further  assumed 
that  the  reliability  of  a single  module  is  exponentially  distributed  or  that  the  failures  of 
a single  chip  follow  the  Poisson  distribution.  In  other  words, 

Probability  of  k failures  in  time  interval  (0,t)  = e-*1  (Xt)k  / (k!)  (1.1) 

• Reliability  = probability  of  no  failures  in  (0,t) 

= e~M  (1.2) 

With  these  assumptions,  if  a system  does  not  contain  any  redundancy  (i.e.  every 
subsystem  must  function  properly  for  the  system  to  work),  the  system  reliability  is 
also  exponential  in  nature.  Furthermore,  the  failure  rate  of  the  system  is  the  sum  of 
failure  rates  of  individual  modu..  . 

To  estimate  the  failures  rates  of  chips  we  will  use  the  data  published  in  the 

Military  Standardization  Handbook  217--B  [Mil74].  The  handbook  suggests  the  following 

model  for  the  failure  rate  of  a single  chip. 

X « n,  n,  < M,  G*  ♦ ncd;  ) (1.3) 

where  d,,  d7,  </,  (\  and  it’s  are  various  constants, 

G is  the  number  of  gates  in  the  chip. 
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The  constants,  n,,  iu,  nr,  and  n„  depend  on,  respectively,  learning  (experience  in 
using  the  particular  chip),  environment  (whether  the  system  is  ground  based,  fixed, 
spaceborne,  etc.),  expected  junction  temperature,  and  quality  control  (how  rigorously 
has  the  chip  been  subjected  to  tests  and  burn-in).  Table  1.1  shows  the  failure  rates 
for  chips  with  various  number  of  gates.  The  system  is  assumed  to  be  fixed,  ground 
based;  the  quality  control  factor  is  assumed  to  be  10  (which  is  in  between  the  M1L-STD 
factor  of  1.0  and  the  factor  to  be  used  for  minimal  industrial  quality  control,  150),  and 
the  junction  temperature  is  assumed  to  be  50f>C.  The  other  constants,  d,,  d?,  <yi  and  /(, 
are  fixed  and  equal  lo  0.00129,  0.00389,  ...67  and  0.35  respectively. 

Table  1.1 


Gales 

Failures 
in  106  hours 

Gates 

Failures 
in  10^  hours 

1 

0.04342 

9 

0.10559 

2 

0.05711 

10 

0.11037 

3 

0.06721 

11 

0.11490 

4 

0.07553 

12 

0.11920 

5 

0.08275 

13 

0.12332 

6 

0.08920 

14 

0.12728 

7 

0.09508 

15 

0.13108 

8 

0.10052 

16 

0.13475 

The  estimation  of  failure  rate  for  a module  is  best  described  by  an  example  in  the  next 
section. 

An  interesting  phenomenon  caa  be  observed  if  we  plot  the  failure  rate  per  gate 
versus  the  number  of  gates  per  ship.  As  semiconductor  components  get  larger  they 
also  become  more  reliable  per  function,  up  to  a point  of  diminishing  returns.  Figure 
1.1  depicts  the  failure  rate  per  million  hours  per  gate  as  a function  of  the  number  of 
gates  on  a chip.  The  curves  marked  65a,  65b  were  derived  from  data  in  [Mil65]  circa 
1965  while  the  curve  marked  74  was  derived  from  [Mil74]  circa  1974.  Two  trends  can 
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be  noted.  In  the  1974  data,  gate  functions  exhibit  as  much  as  an  order  of  magnitude 
decrease  in  failure  rate  up  to  a density  of  approximately  100  gates.  Prior  to  the  100 
gate  minimum,  packaging  and  lead  failu'  s domic..!  . Beyond  that,  gate  failure  rate 
increases  due  to  immaturity  of  the  fabrication  process. 

The  1965  data  was  incomplete  due  to  the  newness  of  integrated  circuits.  One 
study  (curve  65a)  showed  that  a failure  rate  of  0.4  per  10®  hours  w?<  a good 
approximation  for  state  of  the  art  ICs  at  that  time  (one  to  four  gates  per  IC).  Another 
study  examined  small  functional  units  composed  of  discrete  components  and  ICs. 
Various  ten  element  units  showed  failure  rates  of  0.S3  to  1.8  per  10®  hours  (curve 
65b).  While  the  data  is  incomplete  it  is  reasonable  to  assume  that  gate  functions 
become  more  reliable  with  time.  In  a like  manner,  the  point  of  minimum  faiiure  rate  per 
gate  can  be  assumed  to  be  moving  to  the  right  as  technology  matures.  Thus 
constructing  systems  from  larger  components  can  lead  to  significantly  more  reliable 
systems. 

1.3.  Example 


As  an  example  let  us  consider  the  Processor  Interface  module  in  C.nimp.  Figure 
1.2  shows  the  chip  lay-out  and  list  of  parts  for  the  Processor  Interface.  From  this  data 
we  form  Table  1.2. 


Table  1.2  Calculation  of  X for  Processor  Interface 


No. 

Id 

Gates 

X 

1 

745140 

4 

0.07553 

1 

7440 

4 

0.07553 

1 

7404 

6 

0.08920 

2 

745)38 

16 

0.13475 

6 

7438 

4 

0.07553 

13 

74S74 

10 

0.11037 

From  individual  Xs  in  the  table,  we  estimate  X for  the  Processor  Interface  to  be 
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X = 0.07553  ♦ 0 07553  ♦ 0 08920  ♦ 2(0  13475)  ♦ 6(0.07553)  ♦ 13(0.11037) 

= 2.39775  failures/10^  hours. 

Thus  we  arrive  at  the  failure  rate  for  the  Processor  Interface  board.  Similar 
calculations  have  been  carried  out  for  all  subsystems  of  C.mmp  to  yield  the  overall 
system  failure  rate. 

The  following  section  will  rompare  reliabilities  of  both  C.mmp  and  Cm*  (Computer 
Modules)  in  both  non-redundant  and  redundant  configui  ations 
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2.  Reliability  Comparison  of  C.mmp  and  Cm* 


Paris  count  reliability  models  were  developed  lor  both  of  the  in-house  systems 
at  CMU,  the  multiminiprocessor  C.mmp  and  the  computer  modules  system  Cm*.  As  the 
conceptual  diagrams  of  Figure  2.1  depict,  both  are  general  purpose  multiprocessor 
systems.  C.mmp  is  a general  purpose  system  with  a fixed  architecture.  Up  to  16  PDP- 
11  processors  (Pc)  can  communicate  with  up  to  16  shared  memory  ports  (Mp)  thorugh 
a crosspoint  switch  (Smp)  (Figure  2.1a).  Cm*,  on  the  other  hand,  has  a flexible 
architecture  that  may  be  so  modified  as  to  afford  optimal  performance  for  a given 
application.  Cm*  is  a modular,  multi-micro-processor  system  based  on  the  LSI-11 
processors  as  depicted  in  Figure  2.1b.  Each  Computer  Module  (Cm)  is  connected  via  an 
interface  (S.local)  to  an  intelligent  cluster  controller,  K.map.  The  clusters  can  be 
interconnected  via  L.inc’s.  Each  Cm  can  share  memory  with  any  other  Cm  in  the 
network  through  routing  tables  in  the  K.map. 

As  multiprocessor  systems,  both  C.mmp  and  Cm*  offer  potential  processing 
power  well  beyond  that  of  a single  processor.  C.mmp  has  an  upper  limit  of  16 
processors  (since  the  existing  switch  has  only  16  ports  for  processors).  In  concept, 
the  Cm*  architecture  is  arbitrarily  extendible;  the  only  limiting  factors  are  the  cost  and 
fundamental  limits  of  the  programmed  algorithms.  When  all  of  the  processing  power  is 
not  required  or  graceful  degradation  of  processing  is  tolerable,  it  is  possible  to 
conceive  of  either  C.mmp  or  Cm*  as  a potentially  redundant  architecture.  If  a task 
requires  the  minimal  processing  capability  of,  for  example,  only  four  processors,  then 
we  may  view  the  other  processor;  as  sl«,.  by  spares  or  expendables.  Assuming  that 
we  can  detect  and  locate  a faully  component  (processor,  memory,  switch),  and  the 
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malfunction  war.  not  an  irrecoverable  one,  we  can  then  logically  replace  a faulty 
component  with  a stand  -by  spare  or  simply  exclude  it  from  the  system.  The  structures 
are  thus  fault-tolerant  and  have  greater  reliability. 

To  arrive  af  the  reliabilities  of  multiprocessor  fault-tolerant  systems,  we  need  to 
use  two  levels  of  modeling.  We  apply  the  parts  count  reliability  model  to  estimate  the 
failure  rates  of  individual  modules.  Then  using  the  reliabilities  of  these  non-redundant 
modules,  we  model  the  fault -tolerant  system  to  arrive  at  a system  reliability. 

2.1.  Parts  Count  Reliability  Model 

The  failure  rates  for  standard  !C  chips  are  found  in  the  Military  Standardization 
Handbook  (MIL-  STD- I IDI3K-217I3).  Assuming  exponential  distribution  for  reliability  and 
mutual  statistical  independence,  the  failure  rate  of  each  module  is  estimated  as  the  sum 
of  the  failure  rates  of  its  various  components.  Since  the  handbook  also  predicts  the 
failure  rates  for  such  components  as  resistors  or  printed  boards,  completeness  of  the 
model  is  assured.  The  following  are  the  failure  rates  for  various  modules  of  C.mrnp  and 
Cm*  systems  using  the  pads  count  reliability  model. 

Component  Failure  rate 

(failures  per  10^  hrs.) 


C.mrnp 

PDP-11 /40 

57.496 

Processor  associated  circuitry 
(RFLOC,  processor  interface) 

11.414 

Memory  box  (16K  words;  core) 

54.225 

Memory  associated  circuitry 
(Priority  decode,  etc.)  per  port 

7.14 

Switch 

202.403 

Cm* 

LSI- 11  processor 

109.0 

Memory  (12K  words;  semiconductor) 

203.343 

K.map 

178.414 
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2.2.  Redundancy  Model  for  C.mmp 


In  the  absence  of  data  on  fault  detection/propagation  and  module  replacement 
capabilities  in  mulliprocessor  systems,  we  use  the  following  simplistic  model  (giving  us 
the  upper  bound  on  potential  reliability).  If  there  are  N identical  components  with  the 
reliability  of  each  component  R0)  (R0  = e_M  where  X = failure  rate),  and  if  a task 
requires  k components,  the  subsystem  can  tolerate  upto  N-k  failures,  and  the  reliability 
of  such  a subsystem  is 

(1  -R0>‘  (2.1) 

Thus  the  reliability  of  C.mmp  with  16  processors  and  16  64K-memories,  with  at 
least  four  processors  and  four  memory  ports  required  for  the  task,  is 

R,  < j£  (\6)  R*-'  (i  - Rp)'  ) ( £ f.6)  R»‘_l  (1  - R„)»  ) (2.2) 


where  Rs  = switch  reliability  = e-I0Xt 

R = (processor  ♦ associated  circuitry)  reliability  ■» 

R„  = (memory  + associated  circuitry)  reliability  = e-1***1 

Figure  2.2  shows  the  reliabilities  of  a 16-processor  C.mmp  system  as  a function 
of  time  for  tasks  requiring  various  number  of  processors.  The  plot  for  task 
processors  >*16  is  the  reliability  of  a totally  non-redundant  C.mmp.  The  dramatic 
increase  in  reliability  as  the  number  of  task  processors  decreases  is  evident. 


2.3.  Redundancy  model  for  Cm* 


For  the  Cm*  system,  we  will  present  a series  of  system  reliability  models,  each 
one  more  accurate  than  the  preceding  ones.  By  following  the  stepwise  refinement,  the 
reader  will  understand  the  origin  of  each  term  in  the  final  equation.  Modeling  will  be 
performed  at  the  PMS  (processor,  memory,  switch)  level.  The  components  and  their 
failure  effects  are  listed  below. 
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! 


Component 


Effect  ol  component  failure 


LSI- 11 
processor 


Loss  of  processor,  not  its  associated  memory 


4K  memory  Loss  of  memory 


S.  local 


Loss  of  processor  and  its  associated  memory 


K.map 


Loss  of  cluster 


Loss  of  L ine,  possible  reduction  in  processing  power 
and  memory  capacity 


Generally,  Hie  above  failure  effects  are  pessimistic.  However,  there  are  a small 
number  of  potential  failures  that  are  more  severe  than  indicated.  For  example,  a 
processor  failure  might  short  a bus  control  line  thus  disabling  the  bus  and  making  the 
local  memory  inaccessible.  The  number  of  such  failures  is  small  and  their  effects  will 
be  ignored  for  Ihe  current  development. 

Consider  a single  cluster  with  N I.SI-ll’s.  If  K are  required  for  a task  then  the 
reliability  is  : 

RS*S  - » { £ (?J  (RpPmP.)*-'  (1  - RpRmR*)'  } (2.3) 

where  R,„  - reliability  of  the  K.map 
R„  « reliability  of  the  processor 
Rs,  = reliability  of  S.local 
Rm  = reliability  of  the  12K  memory. 

Figure  2.3  represents  the  reliability  for  N ^ 8 and  K = 4,  6 and  8.  The  equation 
above  represents  a best  case  model  in  that  perfect  recovery  is  assumed.  To  model 
imperfect  recovery  a factor  called  coverage  [I3ourW71]  is  introduced.  Coverage,  C,  is 
the  conditional  probability  that  Ihe  system  recovers  successfully  given  that  there  was 
a failure.  Assuming  the  system  fails  the  first  time  recovery  or  component  exhaustion 
occurs,  the  system  reliability  becomes  : 

R.V..  - R*„  { ? (?) <R„Rr,  RJ*-'  (1  - RpR*, RJ'  C'  } (2.4) 
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The  effect  of  nonperfect  coverage  is  shown  in  Figure  2.4.  The  actual  value  of 


the  parameter  C will  be  derived  from  a study  of  the  error  detection/recovery  features 


of  the  Cm*  hardware,  and  is  beyond  the  scope  of  this  paper. 


By  varying  the  network  topology  and  requirements,  we  can  vary  the  resultant 


system  reliability.  Consider  the  two  cluster  network  in  Figure  2.1b.  Each  cluster  has 


eight  Cm’s  and  each  Cm  has  12K  of  memory  in  addition  to  the  4K  on  board  the 


processor.  Assume  that  at  least  K processors  and  I 4K  memory  modules  must  function 


for  the  network  to  be  performing  its  task.  To  assess  the  reliability  we  list  the 


following  states  and  the  corresponding  probabilities. 


(i)  Both  K.maps  and  Line  good, 


(ii)  One  K.map  fails,  Line  good,  2R,  R|cn(l-Ric»)R|icC|M 


(iii)  Cne  K.map  fails,  Line  fails,  2(1-R,  IRiad-RitJRitrCmC, 


(iv)  Line  fails,  both  K.maps  good,  (1-R,  )R?!»(2R)|(-R?J;)C/ 


where  R,  = Line  reliability 
R,  * “ K.map  reliability 

R*  ■*  reliability  of  two  clusters  such  that  the 
number  of  processors  is  greater  than  k and 
number  of  memories  is  greater  than  I 
Rjic  » same  as  R*  for  one  cluster 
Qu.  " coverage  factor  for  K.map 
C,  * coverage  factor  for  Line. 


Thus  the  system  reliability  s given  by  summing  the  above  states 


R*vi  ” RfRfcuR;*  * 2R/  Rfci,(  I 'RkiilRtkCfcni  ♦ 

2<1-R,)Rk.<1-Rk.>R,,A.C,  ♦ ( 1 -R,  )Rm»(2R1(;  -R*  )C, 


Wi»  will  now  derive  the  one  cluster,  Rlk,  and  the  two  cluster,  R*,  terms. 


For  R1)t  the  system  fails  if  there  are  fewer  than  K processors  or  fewer  l 


memories.  A proce:«sor  can  be  denied  to  the  system  through  a processor  or  S.local 
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failure.  Similarly,  an  S. lot  at  failure  can  deny  the  associated  memories  to  the  system. 
Therefore, 

R,fc  "£  (f)  cs(  R,V»  (1-RSI)*  Rproei  R^,  (2.6) 

where 

Rs,  = S.local  reliability 
Cs,  «*  Covarnge  factor  for  S.local 
Reroc i " aggregate  reliability  of  processors  with 
working  S.locals 

R»e»i  * aggregate  reliability  of  memories  with 
working  S.locals 

Rproc  t = *£  CJ  j Rp» -•  -J  ( 1 -RP)j  (2.7) 

where  C„  = coverage  factor  for  processor. 

The  equation  above  indicates  that  the  system  only  works  if  at  least  K 
processors,  whose  associated  S.local  are  functioning  correctly,  are  nonfailed.  If  we 
assume  that  the  reliability  of  the  4K  memory  on  the  processor  board  is  the  same  as 
the  other  4K  memory  and  that  processor  and  on-board  memory  failures  are 
independent,  then 

R-m. i c:  (32n  /,i)  R «-*• -n  ( 1 ~RJ"  (2.8) 

where  Cm  ■=■  coverage  factor  for  one  4t(  memory. 

By  analogy, 

R*  - .E  M cs(  RJ*~'  (1  -Rs, )'  . { (EC;  (1-RP)J  ). 

*£  c"  (6Vj  **'*''"  (l-RJn  ) ) (2.9) 

The  Rsms  as  calculated  from  these  equations  is  plotted  in  Figure  2.5  with  various 
values  of  K,  / and  all  coverage  factors  assumed  to  be  one.  The  model  can  bo  extended 
in  an  obvious  manner  for  a larger  number  of  clusters  and/or  Cm’s.  Based  on  the 
procedure  outlined  above,  a program  is  being  developed  that  takes  any  general  PMS 
structure  and  minimal  component  requirements  as  input,  and  provides  the  system 
reliability  as  output. 
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Frequently  redundant  systems  are  compared  via  mission  time  improvement,  MTI 
[KnoxJ64],  If  t„i  is  the  time  for  which  the  RsyB  of  system  i is  above  a certain  minimum 
mission  reliability,  then  t„i  is  called  the  mission  time,  and  twl/tM?  is  the  MTI  of  system 
one  over  system  two.  To  compare  the  single  cluster  Cm*  network  against  a non- 
redundant  LSI- 11  processor,  we  solve  the  equation  : 

RM8(t,>  = Rwl.„(t?)  (2.10) 

Since  the  MTI  is  not  constant  over  all  values  of  Rt*i_n  we  plot  it  as  a function 
of  Rtti-n  in  Figure  2.6. 

In  Figure  2.7  we  plot  the  MTI  of  C.mmp  over  LSI-11.  In  order  to  compare  the 
MTI  with  that  of  Cm*,  the  memory  size  of  each  port  of  C.mmp  was  normalized  to  16K 
words.  The  Cm*  system  of  Figure  2.6  is  seen  to  offer  greater  mission  times  than 
C.mmp.  As  work  progresses,  the  reliability  model  will  be  refined  and  compared  against 
actual  operational  data. 

So  far  we  have  considered  the  computer  systems  as  stand-alone  systems  that 
fail  upon  component  exhaustion.  In  the  following  section,  we  will  investigate  the  effect 
of  periodic  maintenance  on  mission  time. 
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3.  Effect  of  Periodic  Maintenance  on  Reliability 


3.1.  Introduction 

In  an  attempt  to  increase  the  life  of  a non-perfect  system,  various  redundancy 
techniques  have  been  applied.  It  has  been  shown  that  TMR  (triple  modular  redundancy) 
with  stand-by  sparine,  can  be  used  lo  achieve  improvement  in  reliability  and  expected 
life.  While  the  general  analysis  holds  perfectly  well  for  systems  performing  vital 
functions  in  a spare  mission,  it  is  extremely  pessimistic  in  a more  commonly  used 
system  where  a technician  may  be  able  to  perform  repairs.  Even  more  common  is  a 
situation  in  which  certain  tests  are  applied  to  the  system  at  regular  intervals  to  insure 
its  integrity,  followed  by  any  required  repair.  It  is  to  be  expected  in  most  cases  that 
the  life  span  of  a system  subjected  to  such  maintenance  would  be  greater  than  that  of 
an  identical  system  left  unattended.  In  the  following  analysis  we  estimate  the  expected 
life  of  a non-perfect  system  under  periodic  maintenance. 

3.?.  Life  of  An  Umriaintainod  System 

As  has  been  the  common  practice,  we  will  assume  the  failures  in  a non- 
redundant  system  to  have  an  exponential  distribution.  We  will  denote  the  failure  rate 
by  X.  The  reliability  of  the  system  (i.e.  the  probaoility  that  there  is  no  failure  during 
the  lime  interval  (0,1))  is  then  R(t)  = e-*‘.  The  life  of  the  system  is  estimated  as 
follows. 

Lei  T be  the  lime  at  which  a failure  occurs.  Then  the  distribution  function 
F (t)  is 

F (t)  « Prob  (the  failure  occuring  in  (0,t )) 

= Prob(0<7'<t) 
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(3.1) 


- 1 - Prob('f>t) 

- 1 - R(t) 

The  life  of  the  system  is  the  expected  value  of  T,  E(I’). 

And,  E(T)  * j[°(l  - F(t))  dt 

= At)  rft  (3.2) 

Note  that  the  above  equation  is  a general  one,  and  may  be  applied  to  any 
function  R(t).  Also  note  that  according  to  the  equation  above,  E(0,  or  the  life  of  the 
system,  is  the  same  as  the  area  under  the  curve  for  any  function  R(t).  For  the 
exponential  distribution, 

ECO  = 1/X.  (3.3) 


3.3.  Life  of  A Maintained  System 


Let  us  now  consider  a system  being  operated  under  periodic  maintenance.  We 
assume  that  certain  tests  are  performed  at  a regular  time  interval,  /?.  Every  occasion 
on  which  these  tests  are  performed  with  success  (or  the  failures  of  the  tests  are 
followed  by  subsequent  necessary  repairs),  there  is  a lesser  probability  of  failure  in 
immediate  future.  This  leads  to  an  increase  in  reliability  after  every  maintenance 
period.  Note  that  this  model  differs  from  a repair  model.  In  the  former,  maintenance 
and  repair  are  conducted  periodically,  and  the  system  is  considered  failed  if  there  is 
any  system  failure  between  maintenance  periods.  The  system  is  considered  unavailable 
during  maintenance  and  the  duration  of  maintenance  is  considered  unimportant.  The 
latter  model  only  schedules  repair  after  a failure.  This  distinction  is  particularly 
significant  in  redundant  systems  where  the  repair  model  has  a large  number  of  states 
and  is  critically  sensitive  to  the  repair  rate  [ShooM68].  We  shall  consider  two  models 
for  the  improvement  obtained  by  periodic  maintenance. 
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Model  I : Improvement  through  maintenance  = d (1  - Rsys(/?)),  where  d is  a constant, 
0<d<l,  and  corresponds  to  the  percentage  of  failures  detected  by  the  diagnostic 
procedure.  This  model  assumes  that  a constant  fraction  of  the  lost  reliability  is 
recovered.  Figure  3.1  shows  a hypothetical  reliability  function  using  this  assumption. 

Let  R((t)  **  system  reliability  function  during  the  ith  interval. 

Then, 

R.(»)  - FWt)  (3.4) 

Rr(t)  - { Rm,(/?>  ♦ d (1  - } Rsys(t) 

R*(t>  “ { RsyB(/?>  ♦ d (I  - RsyB(/?»  R„y„(/?>  + d <1  - Rsys(/?))  } Rtyg(t) 

= { RL<W  * d (1  - R.V/*>  } Rsv,g(t) 

And, 

Ru,(t)  *’  { Rs‘yB(^)  ♦ d (1  - RLs(^)  } RMi(t)  (3.5) 

The  area  under  the  (i  rl)st  segment  is 

r* 

Aut  - / { Rs«b(/?>  + d (1  - R‘yB(/?)  } RMB(t)  dt  (3.6) 

° . . 

- { Rm«</*>  + cl  (1  - R‘ys(/?)  } / Rsy„(t)  dt 
For  a nonredundant  system,  Rsys(t)  =*  e-*1,  and 

An,  -(1/AM1  -e-»f)  { e-w  ♦ d(l  -e-'**)  } (3.7) 

The  attempt  to  evaluate  life,  which  is  also  the  sum  of  all  Aj’s,  yields  infinity  for 
an  answer.  This  is  due  to  the  fact  that  under  the  assumptions,  the  reliability  function 
reaches  a steady  state  as  shown  in  Figure  3.2.  The  area  under  this  reliability  function 
is  not  finite. 

Since  the  tests  will  not,  in  general,  test  all  possible  system  components,  there 
will  remain  components  that  are  never  replaced  or  repaired  during  the  periodic 
maintenance.  These  components  will  eventually  fail.  Thus  an  expected  infinite  life 
span  is  unrealistic. 


— ■ 
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Model  II  : In  the  first  model,  we  assumed  the  improvement  due  to  maintenance  to  be  a 
constant.  In  a more  pessimistic  model,  we  may  assume  that  the  improvement  in 
reliability  due  to  maintenance  at  the  end  of  the  ith  period  is  a fraction  of  the  reliability 
lost  in  the  ith  period.  In  other  words, 

Ri+i(t)  - { R,</?)  + d R,<0)  (1  - Rsys(/?))  } Rgy„(t)  (3.8) 

Writing  R|’s  explicitly,  we  have 

R.(t)  - Rsy„(t)  (3.9) 

R?(t)  - { R„B(/S)  ♦ d (1  - Rsys</?))  } Rsys(t) 

R»(t)  - { * cl  Rsy„(/?>  (I-RsyB(/?»  + d Rsys(/?)  ♦ d2  (l-Rsvs(^) 

- d Rsys(/?>  - d2  Rsys(/?)  (1  - R„.(/H)  } Rsys(t) 

= { R*„.(/»)  ♦ d <1  - REyB(/?»  }2  Rsys(t) 

And,  in  general, 

R|+i(t)  - { RM.W)  + d (1  - REMI(/S»  }'  Rsys(t)  (3.10) 

The  area  under  the  (i+l)st  segment, 

Aui  - J Ru,(t)  <ft 

- { ««„(/*)  ♦ d (1  - Rsya(/?))  )'  / Rsys(t)  rft  (3.11) 

## 

Then  the  life  span  for  the  system  is 
Life  = T.  A(+) 

- U d\)Z{  Rsysw  + d (1  - RsyB(/?)) }' 

M 

- (j£  R.«<t)  «ft  ) { 1 - R,„(/*>  - d (1  - RBy.(^)  hi  - 

- ( l R.w.(t)  d t)  (1  - Rsys(/?))-i  (1  - d)-»  (3.12) 

Again,  for  a nonredundant  system,  Rsyg(t)  = e-M,  and 

life  = 1 / { X (1  - d)  } (3.13) 

The  life  span  is  thus  improved  by  a factor  of  (1  - dH.  If  d - l,  the  maintenance 
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is  perfect  and  the  life  span  tends  to  be  infinite.  If  d - 0,  we  revert  back  to  an 
unmaintained  system  with  life  span  (1/X).  For  a detection  probability  of  d *»  0.9,  a fairly 
modest  goal,  the  expected  life  increases  by  a factor  of  10. 


3.4.  Redundant  Systems  Wit  h Maintenance 


Until  now  we  have  considered  a non-redundant  system  with  R(t)  * e~At.  Where 
reliabilities  of  high  order  (better  than  one  failure  in  a million  hours)  are  required, 
improvements  through  technological  advances  alone  fall  short  of  the  objective.  System 
designers  have  resorted  to  redundancy  to  attain  higher  reliabilities.  Triple  modular 
redundancy  (7 MR)  with  majority  voting  is  one  of  the  redundancy  techniques  used. 
Since  such  a technique  allows  the  system  to  tolerate  a single  failure  in  any  of  the 
three  modules,  the  reliability  of  a TMR  system  is 


^SV»K  «)  - R*(t)  + 3 R2(t)(l  - R(t)) 

= 3 R2(t)  - 2 R»(t) 

where  R(t)  - reliability  of  non-redundant  system. 


(3.14) 


Since  we  intend  lo  keep  the  development  applicable  in  genera’,,  we  will  use 
Rsmb(U  during  the  discussion  and  substitute  the  expression  for  a TMR  system  only  to 
exemplify  the  results. 


3 5.  Life  of  An  Unmaintained.  Redundant  (TMR)  System 


Recalling  that  the  life  span  of  a system,  ECO,  is  the  same  as  that  of  the  area 
under  the  reliability  function  we  can  write 


Life  = Jf  Rsys(t)  c/t 
For  a TMR  system, 

Life  (TMR)  - / { 3 R2(t)  - 2 R*(t)  } rft 
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= (1/M  ( 3/2  - 2/3  ) 

- 5 / (6  M (3.15) 

Again,  we  now  focus  our  attention  to  a system  under  periodic  maintenance. 

3.6.  Life  of  A Maintained.  Redundant  (TMR)  System 

We  will  again  consider  the  two  models.  Under  I he  first  simple  model,  we  assume 
that  the  improvement  through  maintenance  ir  a constant  fraction  of  the  probability  of 
failure,  i.e.  d (1  - Rsy!i(/?))  where  /?  is  the  period  of  the  maintenance  cycle. 

We  have  established  that  the  area  under  the  (i  + l)st  segment  is  : 

Ai+1  - { Rs‘ys</*)  ♦ d (1  - Rs\,s(/t)  ) J Rsss(t)  rft 

O 

For  a TMR  system, 

Rsys(t)  - 3 e-2M  - 2 e-*»« 
and 


' = (3  / 2X)  (1  - e-2*i')  - (2  / 3X)  < 1 - e-Vf) 

We  again  note  that  when  we  try  to  sum  all  Aj’s,  the  results  approaches  infinity. 
Considering  the  second  model  that  we  suggested  earlier,  where  the  improvement 
in  reliability  is  a fraction  of  the  reliability  lost  in  the  ith  period,  we  may  write 

Again  we  have  established  that  area  under  the  (i+l)st  segment, 

r t 

Au,  - { Rs„(/?)  ♦ d <1  - RKys(/t))  }>/  Rsys(t)  rft 

And  the  life  span  for  the  system  is 

ft 

Life  - ( / RMB(t)  rft)  (1  - Rt„(/?»-l  (1  - d)-l 

9 

For  a TMR  system, 
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and  RsyB(/?)  = 3 e~2™  - 2 e-**<V 

Substituting  these  two  expressions  in  the  equation  for  Life  we  can  estimate  the 
life  span  of  a TMR  system.  As  the  expression  is  very  complex,  the  improvement  is  not 
obvious  in  this  form.  Let  us,  therefore,  consider  numerical  examples.  The  following 
table  (Table  3.1)  allows  the  comparison  in  the  life  spans  of  unmaintained  and 
maintained  TMR  systems. 

Table  3.1  Expected  life  with  periodic  maintenance;  Lamda  » 0.0001 
Life  of  an  unmaintained  TMR  system  : 8333.33  hours 


d 

10 

100 

P (hours) 
1000 

10000 

1000000 

0.20 

423619.13 

48692.87 

11958.48 

10416.67 

10416.67 

0.40 

564825.50 

64923.83 

15944.64 

13888.89 

13888.89 

0.60 

847238.26 

97385.74 

23916.96 

20833.33 

20833.33 

0.80 

1694476.49 

194771.48 

47833.92 

41666.67 

41666.67 

0.90 

3388952.97 

389542.96 

95667.84 

83333.33 

83333.33 

0.92 

4236191.38 

4 86928.72 

119584.80 

104166.67 

104166.67 

0.94 

5648254.82 

649238.25 

159446.39 

138888.88 

138888.88 

0.96 

8472382.75 

973857.43 

239169.60 

208333.34 

208333.34 

0.98 

16944762.34  1947714.50 

478339.11 

416666.60 

416666.60 

The  data  conform  to  the  expectations.  As  the  period  between  maintenance 
becomes  larger,  the  expected  life  becomes  shorter.  We  note  that  after  p = 10000 
hours,  the  life  is  shorter  than  the  period  of  maintenance.  Thus  the  maintenance  has  no 
effect  on  the  performance  for  any  period  larger  than  10000.  The  life  depends  more 
strongly  on  d,  as  seen  by  the  sharp  increase  of  life  as  d approaches  unity.  The 
constant  d is  a function  of  how  efficient  maintenance  routines  are.  Since  the  expected 
life  in  extremely  sensitive  to  the  effectiveness  of  maintenance  around  d **  0.9,  slight 
improvement  in  maintenance  routines  may  be  rewarded  with  substantially  greater  life 
for  the  system.  This  fact  is  emphasized  when  we  plot  the  expected  life  of  a TMR 
system  as  a function  in  Figure  3.3. 
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3.7.  Life  of  C.mrnp  and  Cm*  Systems 


Now  let  us  examine  the  effect  of  periodic  maintenance  on  the  two  systems  under 
investigation.  We  will  use  the  general  expression  for  the  system  life  with  periodic 

maintenance  (equation  (3.12)),  namely, 

ft 

life  - ( Jf  RMB(t)  c/I)  (1  - R,„(/fl)H  (1  - d) -I 

For  C.mrnp  with  16  processors  and  16  64-K  memories,  and  with  at  least  four 
processors  and  four  memory  ports  required  for  a task,  the  RsyB  is  given  by  equation 
(2.2)  : 

Rs  ( g (^6)  Rj‘-‘  (1  - Rp)'  ) ( £ (f)  R^‘-‘  (1  - R„Y  ) 

Substitution  in  equation  (3.12)  and  numerical  evaluation  of  life  yielded  the 
following  results. 


Table  3.2  Expected  life  of  a maintained  C.mrnp  system  : 


d 

/?  (hours) 

100 

1000 

10000 

100000 

0.20 

6175.80 

6175.80 

6158.50 

5992.46 

0.40 

8234.40 

8234.40 

8211.33 

7989.95 

0.60 

12351.60 

12351.60 

12316.99 

11984.93 

0.80 

24703.19 

24703.19 

24633.98 

23969.85 

0.90 

49406.38 

49406.38 

49267.97 

47939.71 

0.92 

61757.98 

61757.98 

61584.96 

59924.63 

0.94 

82343.97 

82343.97 

82 1J  3.28 

79899.51 

0.96 

123515.97 

123515.96 

123169.92 

119849.27 

0.98 

247031.88 

247031.87 

246339.79 

239698.49 

A significant  fact  comes  to  light  in  this  exercise.  There  is  a small  improvement  in 
the  system  life  as  the  period  of  maintenance  is  reduced  from  100000  to  10000  to 
1000.  But  when  it  is  further  reduced  to  100  hours,  there  is  no  improvement  for 
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life  - {X*(l-d)}-»  { ( 1 -c-*  <•)-!  + 1/2  } 


(3.17) 


Note  that  (3.17)  represents  the  life  of  a duplicate  system  under  periodic 

maintenance.  The  life  of  a duplicate  system  without  periodic  maintenance  is 

3 

life(duplicate  system)  = (3.18) 

2Xk 

To  re-emphasize  the  effect  of  periodic  maintenance,  consider  the  two  equations 
(3.17)  and  (3.18).  For  the  estimated  value  for  Xk  (from  page  8),  and  d = 0.9,  the  ratio  of 
the  two  expressions  is  as  much  as  44  for  /?  = 1000  hours,  and  more  than  350  for 
P>  « 100  hours. 


Numerical  evaluation  of  life  as  predicted  by  equation  (3.17)  leads  to  results 
presented  in  Table  3.3. 

Table  3.3  Expected  life(approximate)  of  a maintained  Cm*  system  : 


d 

(hours) 

100 

1000 

10000 

100000 

0.20 

399708.66 

46379.50 

11923.38 

10509.27 

0.40 

532944.88 

61839.33 

15897.84 

14012.35 

0.60 

799417.33 

92759.00 

23846.76 

21018.53 

0.80 

1598834.62 

185518.00 

47693.52 

42037.06 

0.90 

3197669.24 

37  J 036.00 

95387.05 

84074.12 

0.92 

3997086.70 

463795.01 

119233.81 

105092.65 

0.94 

5329448.61 

618393.31 

158978.41 

140123.53 

0.96 

7994173.41 

927590.03 

238467.63 

210185.31 

0.98 

15988343.84  1855179.71 

476935.16 

420370.54 

It  is 

\ 

readily  noticed  that  these  numbers  are  considerably  high 

the  C.mmp.  While  both  systems  arc  constrained  by  the  components  lacking  potential 
redundancy,  namely  the  switch  for  C.mmp  and  the  Line  and  K.map  for  Cm*,  the 
flexibility  for  structure  in  the  Cm*  system  allows  for  some  failures  of  these 


components.  In  a failure  tolerant  environment,  the  Cm*  system  would  exhibit  between 
2 to  9 times  longer  life  than  the  C.mmp  system. 
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4.  Conclusions 

We  set  out  to  establish  a realistic  reliability  model  for  hardware  failures  in  two 
multiprocessor  systems,  C.mmp  and  Cm*.  A parts  count  reliability  model  was  used  to 
arrive  at  failure  rates  of  various  components  of  the  two  systems.  To  model  accurately 
the  potential  fault  tolerance  offered  by  the  multiprocessor  systems,  various 
redundancy  models  were  proposed. 

Digital  systems,  apart  from  those  used  in  space  missions,  may  be  subjected  to 
maintenance  checks  io  ensure  their  integrity.  Such  checks,  although  short  of  complete 
repair,  should  increase  the  confidence  in  integrity,  and  hence  reliability,  of  a system. 
The  effect  of  periodic  maintenance  was  modeled  using  a parameter  d,  which  signifies 
the  efficiency  of  the  maintenance  checks.  The  system  life  was  shown  to  have  a strong 
dependence  on  d. 

Finally,  the  application  of  the  composite  models,  modeling  redundancy  as  well  as 
periodic  maintenance,  the  two  multiprocessor  systems  were  compared.  The  non- 
redundant  components  figured  prominently  as  the  bottlenecks.  The  flexibility  of 
structure  in  the  Cm*  system  was  reflected  in  its  life  being  considerably  longer  than 
that  of  the  C.mmp  system. 
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Figure  3.1:  Effect  of  periodic  maintenance  on 
s'-stem  reliability. 


Figure  3.2:  Steady  state  readied  by  the  reliability 
function  in  the  first  model  for  periodic 
maintenance. 
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